Skip to content

Conversation

rhshadrach
Copy link
Member

This takes a bit of a perf hit with SeriesGroupBy.value_counts; I think this is because the current implementation always sorts the groupers (which is also fixed here).

# group boundaries are where group ids change
idchanges = 1 + np.nonzero(ids[1:] != ids[:-1])[0]
idx = np.r_[0, idchanges]
if not len(ids):
idx = idchanges

Also get a good perf improvement with categorical.

Perf code
import numpy as np
import pandas as pd
import time

size = 1000
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]
col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)
col3_values = np.random.choice(col2_possible_values, size=size, replace=True)
df = pd.DataFrame(zip(col1_values, col2_values, col3_values), columns=["col1", "col2", "col3"])

print('DataFrameGroupBy - object')
%timeit df.groupby("col1").value_counts()

print('SeriesGroupBy - object')
%timeit df.groupby("col1")["col2"].value_counts()

df['col2'] = df['col2'].astype('category')

print('DataFrameGroupBy - category')
%timeit df.groupby("col1", observed=True)[["col2"]].value_counts()

print('SeriesGroupBy - category')
%timeit df.groupby("col1", observed=True)["col2"].value_counts()
# main
DataFrameGroupBy - object
4.09 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
SeriesGroupBy - object
1.32 ms ± 7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
DataFrameGroupBy - category
86.1 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
SeriesGroupBy - category
654 ms ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This PR
DataFrameGroupBy - object
4.1 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
SeriesGroupBy - object
2 ms ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
DataFrameGroupBy - category
84.7 ms ± 576 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
SeriesGroupBy - category
83.1 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@rhshadrach rhshadrach added Bug Groupby Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 4, 2023
@rhshadrach rhshadrach added this to the 2.0 milestone Jan 4, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fairly good. Just a merge conflict

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the merge conflicts are resolved

…pby_value_counts_sort

# Conflicts:
#	doc/source/whatsnew/v2.0.0.rst
#	pandas/tests/groupby/test_value_counts.py
…rach/pandas into groupby_value_counts_sort

# Conflicts:
#	doc/source/whatsnew/v2.0.0.rst
@rhshadrach
Copy link
Member Author

Merging to avoid any further whatsnew conflicts

@rhshadrach rhshadrach merged commit 4f42ecb into pandas-dev:main Jan 11, 2023
@rhshadrach rhshadrach deleted the groupby_value_counts_sort branch January 11, 2023 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: SeriesGroupBy.value_counts sorts when sort=False BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category'
3 participants